What types of crime occur in Vancouver based on time and location of criminal activities from 2003 - 2019 and what are the best predictors of a type of crime?
Crime occurs in cities around the world. Although Vancouver is a relatively safe city, it is not an exception to this rule. In order to keep the city safe and enjoyable, Vancouver Police Department (VPD) is at the frontline and is the core of crime prevention and safety. One VPD initiative is to provide an informational data set about criminal incidents that occur in Vancouver to the public. For example, a recently released VPD annual crime statistics for 2019 contains a summary of various types of crime and the difference compared to the 2018 statistics. Some of the notable findings they reported was an increase in the city’s overall violent crime rate in 2019 by 7.2% compared to 2018, largely due to an increase in Assaults which saw an increase of 11.2% (Vancouver, Vancouver Police Department). Other notable changes were Break and Enter in Business Properties (+21.0%) and Theft from Auto (+12.9%). Other types of criminal activities saw lower rates when compared to 2018. For example, Break and Enter in Residential Areas went down by 10.5% (Vancouver, Vancouver Police Department). Criminal activities and their rates are dynamic and affected by multiple factors, therefore difficult to predict. Regardless, crime has a negative impact on individuals and society, and the more informed we are, the less opportunity for criminals and therefore fewer victims. Thus, this project will explore what type of criminal activities are prevalent in Vancouver based on when (time) and where (location) they have occured.
To answer the above question, we will use the VPD crime data set consisting of crime records from 2003 - 2019 in Vancouver. This data set consists of what types of crime have occurred along with relevant information regarding the crime. It includes categories to describe when the crime occured in terms of Year, Month, Day, Hour, and Minute. The crime locations are described by the Neighbourhood - specifying the various areas and districts in Vancouver, Hundred Block - indicating offset locations of the crime, and the X and Y coordinates representing UTM Zone 10 coordinates. From the various categories, the most relevant attributes of the crime seem to be Hour for time, and X and Y coordinates for location. By combining the main predictors of criminal activities, it may be possible to investigate and predict what type of criminal activities would occur. Such analysis would help individuals to take preventative measures and assist the VPD in making Vancouver a safe city.
# May take a while the first time
install.packages("GGally")
install.packages('e1071', dependencies=TRUE)
library(GGally)
library(tidyverse)
library(caret)
library(rvest)
library(repr)
URL <- "https://drive.google.com/uc?export=download&id=1mAjcCK4jQH8LW_rAeX2U3YZ4gO7nCcGe"
original_crime <- read_csv(URL)
# change type to factor instead of characters for later use in classification
original_crime <- original_crime %>%
mutate(TYPE = as.factor(TYPE)) %>%
mutate(HUNDRED_BLOCK = as.factor(HUNDRED_BLOCK)) %>%
mutate(NEIGHBOURHOOD = as.factor(NEIGHBOURHOOD))
head(original_crime)
By the end of our data analysis, we expect to find a reliable and consistent set of parameters that help distinguish and identify the different types of crime occurring in Vancouver, British Columbia. With these findings, we’d be able to successfully predict the potential crimes of a specific time and location-based environmental factors, as opposed to doing so through tedious manual observation techniques. This could be applied as a form of predictive policing, preventing crimes at their roots as opposed to dealing with the consequences. With this, however, we are left with the question of whether or not such statistical predictions are as good in practice as they are in theory, and if so can they keep up with the evolving trends of crime with the passage of time? Furthermore, can the cost of data collection and analysis exceed that of old-fashioned, manual labor?
There will only need to be some data wrangling to remove the NA's from the data, as seen in the summary table. The data set is already tidy because we are considering different time units, such as years and minutes, to be independent, not part of one time column in order to investigate their individual impact. From the correlation matrix, we can see that X, Y, minute, and hour are correlated so we would like to further investigate these attributes to classify the type of crime. Since hundred block and neighbourhood are factors, we cannot use them to predict but it would be interesting to see if there is a correlation between regions and a certain crime type. One way we will visualize the classification results is by making a bar chart of correctly classified vs incorrectly classified crimes across different neighbourhoods in order to clearly see if the next crime type can be accurately predicted for each region.
For each variable in this dataset, a dedicated table of the levels' distribution will be constructed. Depending on whether or not there seems to be a trend in these levels, a respective histogram will be constructed visualizing these trends. For quantitative variables of time showing an interesting pattern and sign of promise as a predictor, a histogram demonstrating each of the crime levels' variability with this given variable, with hopes of a definitive conclusion coming from such. For qualitative variables, a bar chart will be created to see if any of the few or many levels seem to surpass the others in crime cases, and if so will have a bar chart of crime type variety within this individual level. The x and y coordinates work hand in hand, so their ability to create a map through scatterplots will be of much use, and the most seemingly effective predictors will have a dedicated visualization to conclude with before progressing onto K nearest neighbours classification.
# types of crime
original_crime %>%
select(TYPE) %>%
unlist() %>%
levels()
Clearly, there is a very wide range of crimes that has occured in Vancouver throughout the years, and although many have a common, broad theme most differ a lot in nature. However, some can be grouped together so as not to clean up the dataset further, such as the two break and enter levels, the four types of theft, and vehicle collision or pedestrian stiking. Again, however, each category might differ uniquely with any respective predictor so it might not be safe to make such generalizations for now, given that certain predictors should not necessarily imply that if one specific crime were to occur under any particular circumstance, all other crimes under its respective umbrella are too, likely to occur. One important thing to consider is that, upon looking at the dataset itself, it appears that homicide and offence against a person have no records about them, so they cannot be factored into the final classification problem, and will be cleaned out of the dataset upon wrangling.
# Sampling the dataset since it is huge and will take too long to run
nrow(original_crime)
crime <- sample_n(original_crime, 150000) %>%
mutate(HUNDRED_BLOCK = as.factor(HUNDRED_BLOCK)) %>%
mutate(NEIGHBOURHOOD = as.factor(NEIGHBOURHOOD))
nrow(crime)
Table 0.0
# Crime count sorted by type of crime
num_obs <- nrow(crime)
type_dist <- crime %>%
group_by(TYPE) %>%
summarize(n = n(),
percentage = n() / num_obs * 100) %>%
arrange(desc(percentage))
type_dist
# Type of Crime Distribution
options(repr.plot.width=20, repr.plot.height=10)
type_dist_plot <- ggplot(crime, aes(x=TYPE, fill=TYPE)) +
geom_bar() +
labs(x="Crime", y="Frequency") +
ggtitle("1.1: Type of Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
type_dist_plot
Immediately, it is noticeable that theft from vehicle takes up a massive majority of all crime happening in Vancouver this century. As a matter of fact, of the crimes available in this dataset, more than an entire third of all crime is taken up by this particular one! Not only the difference between theft from vehicles and all other crimes but other types of theft in general means that perhaps it might not be a good idea to group them given their difference in weight over all crime in Vancouver. Also, it appears that residential break and enters happen almost double as frequently as commercial ones do, perhaps due to the feasibility of the crime given that homes tend to have lower security than more public locations would. Moreover, traffick related accidents with fatality and homicide, the two crimes concerning the death of a victim, both are relatively rare in contrast to all other crimes. Many of these records have faulty records, however, so before delving into each particular vector of the dataset, we must filter out data that may mislead our predictions. Furthermore, using the entire dataset might be too crowding for certain visualizations, so we must create a subset to be used when needed.
# Clean crime dataset
crime <- na.omit(crime)
map_df(crime, function(col) sum(is.na(col)))
nrow(crime)
# Minute of Crime Distribution
num_obs <- nrow(crime)
min_dist <- crime %>%
group_by(MINUTE) %>%
summarize(n = n(),
percentage = n() / num_obs * 100,
standard = n()/n())
head(min_dist)
Table 2.0
# Hour of crime distribution
num_obs <- nrow(crime)
hour_dist <- crime %>%
group_by(HOUR) %>%
summarize(n = n(),
percentage = n() / num_obs * 100,
standard = n()/n()) %>%
arrange(desc(n))
head(hour_dist)
options(repr.plot.width=20, repr.plot.height=10)
hour_dist_plot <- ggplot(crime, aes(x=HOUR)) +
geom_histogram(binwidth = 1) +
labs(x="Hour", y="Frequency") +
ggtitle("2.1: Hour of Crime Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
hour_dist_plot
# Hour of crime per type distribution
options(repr.plot.width=20, repr.plot.height=20)
hour_dist_per_crime <- ggplot(crime, aes(x=HOUR, fill=TYPE)) +
geom_histogram(binwidth = 1) +
labs(x="Hour", y="Frequency") +
ggtitle("2.2: Hour of Crime Per Type Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
facet_grid(TYPE ~ ., scales="free")
hour_dist_per_crime
# Night crime distribution
options(repr.plot.width=20, repr.plot.height=10)
night_crime <- filter(crime, HOUR >= 17, HOUR <= 19)
cbd_dist_plot <- ggplot(night_crime, aes(x=TYPE, fill=TYPE)) +
geom_bar() +
labs(x="Crime", y="Frequency") +
ggtitle("2.3: Night Type of Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
cbd_dist_plot
Table 3.0
# Day of crime distribution
num_obs <- nrow(crime)
day_dist <- crime %>%
group_by(DAY) %>%
summarize(n = n(),
percentage = n() / num_obs * 100,
standard = n()/n())
head(day_dist)
tail(day_dist)
options(repr.plot.width=20, repr.plot.height=10)
day_dist_plot <- ggplot(crime, aes(x=DAY)) +
geom_histogram(binwidth = 1) +
labs(x="Day", y="Frequency") +
ggtitle("3.1: Day of Crime Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
day_dist_plot
# Day of crime per type distribution
options(repr.plot.width=20, repr.plot.height=20)
day_dist_per_crime <- ggplot(crime, aes(x=DAY, fill=TYPE)) +
geom_histogram(binwidth = 1) +
labs(x="Day", y="Frequency") +
ggtitle("3.2: Day of Crime Per Type Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
facet_grid(TYPE ~ ., scales="free")
day_dist_per_crime
Table 4.0
# Month of crime distribution
num_obs <- nrow(crime)
month_dist <- crime %>%
group_by(MONTH) %>%
summarize(n = n(),
percentage = n() / num_obs * 100)%>%
arrange(desc(n))
head(month_dist)
month_dist_plot <- ggplot(crime, aes(x=MONTH)) +
geom_histogram(binwidth = 1) +
labs(x="Month", y="Frequency") +
ggtitle("4.1: Month of Crime Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
scale_x_discrete(limits=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
month_dist_plot
# Month of crime per type distribution
options(repr.plot.width=20, repr.plot.height=20)
month_dist_per_crime <- ggplot(crime, aes(x=MONTH, fill=TYPE)) +
geom_histogram(binwidth = 1) +
labs(x="Month", y="Frequency") +
ggtitle("4.2: Month of Crime Per Type Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
facet_grid(TYPE ~ ., scales="free")
month_dist_per_crime
Table 5.0
# Year of crime distribution
num_obs <- nrow(crime)
year_dist <- crime %>%
group_by(YEAR) %>%
summarize(n = n(),
percentage = n() / num_obs * 100)
head(year_dist)
tail(year_dist)
year_dist_plot <- ggplot(crime, aes(x=YEAR)) +
geom_histogram(binwidth = 1) +
labs(x="Year", y="Frequency") +
ggtitle("5.1: Year of Crime Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
year_dist_plot
# Year of crime per type distribution
options(repr.plot.width=20, repr.plot.height=20)
year_dist_per_crime <- ggplot(crime, aes(x=YEAR, fill=TYPE)) +
geom_histogram(binwidth = 1) +
labs(x="Year", y="Frequency") +
ggtitle("5.2: Year of Crime Per Type Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
facet_grid(TYPE ~ ., scales="free")
year_dist_per_crime
After exploring minute, hour, day, month, and year of crime as potential predictors, the one variable that stood out most was the progression of the different crimes throughout the hours of a day. To begin with, judging from the distribution of crimes listed under the dataframe in "Minutes", almost half of all crimes were recorded in the 0th minute, clearly for convenience. Furthermore, it is unlikely that all crimes are recorded down to the exact minute they occured. As for "day", given that weekdays and weekends are not indicated (as in, a 7 day basis) all days of the month had a more or less uniform spread of crimes, with the exception of the 31st, for obvious reasons (only half of all months contain it). Month and year showed some interesting trends in figures 4.2 and 5.2, but were barely distinguishable from crime to crime.
Crime variety by hour brings up numerous fascinating trends otherwise unclear by simply comparing frequency of crimes with one another. Firstly, over 20 percent of all crimes recorded in Vancouver this century occur between 5 and 7 pm. Given the business of streets and the dark settings by then, it makes sense that most crimes occur during this hour as that is when victims are most vulnerable and susceptible to criminal activity. At these hours, again, theft from vehicles leads in crimes once again. This lead is far beyond the second crime that follows, perhaps because of the lack of presence or reliability of witnesses in the approaching darkness and the element of stealth that an offender has attacking at night. Moreover, break and enter crimes, mischief, and the three specified types of crime all spike dramatically at midnight! Another prominent trend visible in the above visualizations is that non-vehical or bicycle related thefts peak during the day and trough during the night. Because thefts such as pickpocketing and such require the presence of crowds in high volumes, so this trend also makes sense. Vehicle collisions, likewise, are at a peak too during this time, as busy streets are the most dangerous to drive in afterall.
Table 6.0
# Crime count sorted by hundred-block
num_obs <- nrow(crime)
crime_by_block <- crime %>%
group_by(HUNDRED_BLOCK) %>%
summarize(n = n(),
percentage = n/num_obs * 100) %>%
arrange(desc(n))
head(crime_by_block)
summarize(crime_by_block, n=n())
Table 7.0
# Crime count sorted by neighbourhood
num_obs <- nrow(crime)
crime_by_neighbourhood <- crime %>%
group_by(NEIGHBOURHOOD) %>%
summarize(n = n(),
percentage = n/num_obs*100) %>%
arrange(desc(n))
crime_by_neighbourhood
# Neighbourhood Crime Distribution
options(repr.plot.width=20, repr.plot.height=10)
nbhd_dist_plot <- ggplot(crime, aes(x=NEIGHBOURHOOD)) +
geom_bar() +
labs(x="Neighbourhood", y="Frequency") +
ggtitle("6.1: Neighbourhood Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
nbhd_dist_plot
# Central Business district crime
options(repr.plot.width=20, repr.plot.height=10)
cbd_crime <- filter(crime, NEIGHBOURHOOD == "Central Business District")
cbd_dist_plot <- ggplot(cbd_crime, aes(x=TYPE, fill=TYPE)) +
geom_bar() +
labs(x="Crime", y="Frequency") +
ggtitle("6.2: Central Business District Type of Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
cbd_dist_plot
# Coordinate visualization
options(repr.plot.width = 17, repr.plot.height = 10)
coord_plot <- ggplot(crime, aes(x=X, y=Y, color=TYPE)) +
geom_point(alpha=0.3) +
labs(x="X coordinate", y="Y coordinate", color="Location of Crime") +
ggtitle("7.1: Crime Coordinates Plot") +
xlim(482500, 500000)+
ylim(5450000,5460000) +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
coord_plot
In the context of a knn classification algorithm, only one set of location predictors seemed likely after only looking at this dataset initially. Given the mechanics of locating the nearest datapoint identities based on a distance formula, the XY coordinate set was the only realistic option. To begin with, given the abundance of hundred blocks in this data set, 22,060 to be specific, it should'nt be used as a predictor given tha besides the level of specificity required to classify a crime based off of it, a knn classification method wouldn't be useful either as even if two blocks are adjacent to one another, the algorithm wouldn't consider them "close". Of all neighbourhoods in Vancouver, the Central Business District seems to be where most crimes occur (over a fifth of all crimes: Figure 6.1), which lies at the edge of Hastings street. This comes as no surprise, as this area is notorious for high levels of unemployment, homelessness, drug abuse, among other things. As with the entirety of the dataset, theft from vehicles stands out as the leading crime in the area, followed by other theft not including automobile or bicycle theft, and mischief close in third as seen in Figure 6.2.
When plotting different types of crimes on a coordinate system, mapping Vancouver with its crimes, we can see some trends that make a lot of sense with respect to the nature of crimes and their locations. To begin with, again, most crimes are concentrated around the downtown Central Business District, as we have seen in Figure 6.2, and by the Hastings area. Moreover, vehicle related crimes are patterned along main roads, as they should, and thefts fill in the blocks by downtown whereas break and enter/mischief do so with the blocks further down south, where neighbourhoods and rural housing areas reside. A faint map of central Vancouver can be seen as all these crimes come together, and it shows that although some areas are more susceptible to crime than others, no place is particularly safe.
# Summary plots
morning_crime <- filter(crime, HOUR >= 4, HOUR < 12)
day_crime <- filter(crime, HOUR >= 12, HOUR < 20)
night_crime <- filter(crime, HOUR >= 20 | HOUR < 4)
options(repr.plot.width = 11, repr.plot.height = 6)
morning_plot <- ggplot(morning_crime, aes(x=X, y=Y, color=TYPE)) +
geom_point(alpha=0.1) +
labs(x="X coordinate", y="Y coordinate", color="Location of Crime: Morning") +
ggtitle("7.2: Morning Crime Coordinates Plot") +
xlim(482500, 500000)+
ylim(5450000,5460000) +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
morning_plot
options(repr.plot.width = 11, repr.plot.height = 6)
day_plot <- ggplot(day_crime, aes(x=X, y=Y, color=TYPE)) +
geom_point(alpha=0.1) +
labs(x="X coordinate", y="Y coordinate", color="Location of Crime: Day") +
xlim(482500, 500000)+
ylim(5450000,5460000) +
ggtitle("7.3: Day Crime Coordinates Plot") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
day_plot
options(repr.plot.width = 11, repr.plot.height = 6)
night_plot <- ggplot(night_crime, aes(x=X, y=Y, color=TYPE)) +
geom_point(alpha=0.1) +
labs(x="X coordinate", y="Y coordinate", color="Location of Crime: Night") +
xlim(482500, 500000)+
ylim(5450000,5460000) +
ggtitle("7.4: Night Crime Coordinates Plot") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
night_plot
In order to have a dynamic predictor, we need to take into consideration both time and location of crimes as predictors classifying an unknown type of crime. Looking back at all time variables, the one that gave the most prominent differences from crime to crime is the hour of day. Seeing as different crimes have different patterns throughout the day, but more or less consistent ones beyond and beneath, this is definitely the most reliable predictor in terms of time. When looking at location in context of a knn classification method, the coordinatte system is the most reliable due to it literally mapping out crimes across Vancouver, and serving to identify a crime based on its direct distance to or from a specified location.
In order to combine the three factors used in our K nearest neighbours classifier, above (Figures 7.2, 7.3, 7.4) is the progression of crime across Vancouver from morning to night time, demonstrated by plotted coordinates so as to bring all variables together.
#Summarizing the Data
relabel <- function(TYPE) {
vector = c()
for (t in TYPE) {
if (t == "Break and Enter Commercial" || t == "Break and Enter Residential/Other")
vector <- c(vector, "B")
else if(t == "Mischief")
vector <- c(vector, "M")
else if(t == "Other Theft" || t == "Theft from Vehicle" || t == "Theft of Bicycle" || t == "Theft of Vehicle")
vector <- c(vector, "T")
else if(t == "Vehicle Collision or Pedestrian Struck (with Fatality)" || t == "Vehicle Collision or Pedestrian Struck (with Injury)")
vector <- c(vector, "V")
}
return (vector)
}
#relabel(TYPE)
grouped_crime <- crime %>%
mutate(TYPE = as.character(TYPE)) %>%
mutate(BROAD_TYPE = relabel(TYPE))
Table 8.0
#Grouped crime visualization
num_obs <- nrow(grouped_crime)
head(grouped_crime)
dist <- grouped_crime %>%
group_by(BROAD_TYPE) %>%
summarize(n = n(),
percentage = n() / num_obs * 100)
dist
options(repr.plot.width=20, repr.plot.height=10)
broad_type_dist_plot <- ggplot(grouped_crime, aes(x=BROAD_TYPE, fill=BROAD_TYPE)) +
geom_bar() +
labs(x="Grouped Crime", y="Frequency") +
ggtitle("8.1: Broad Type of Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
broad_type_dist_plot
# Grouped coordinate visualization
managable_rows1 <- grouped_crime %>%
select(BROAD_TYPE) %>%
unlist() %>%
createDataPartition(p=0.5, list = FALSE)
managable_set1 <- grouped_crime %>% slice(managable_rows1)
options(repr.plot.width = 17, repr.plot.height = 10)
grouped_coord_plot <- ggplot(managable_set1, aes(x=X, y=Y, color=BROAD_TYPE)) +
geom_point(alpha=0.5) +
labs(x="X coordinate", y="Y coordinate", color="Location of Crime") +
ggtitle("8.2: Grouped Crime Coordinates Plot") +
xlim(482500, 500000)+
ylim(5450000,5460000) +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
grouped_coord_plot
# Grouped hour of crime per type distribution
options(repr.plot.width=20, repr.plot.height=20)
grouped_hour_dist_per_crime <- ggplot(grouped_crime, aes(x=HOUR, fill=BROAD_TYPE)) +
geom_histogram(binwidth = 1) +
labs(x="Hour", y="Frequency") +
ggtitle("8.3: Hour of Crime Per Grouped Type Distribution") +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold")) +
facet_grid(BROAD_TYPE ~ ., scales="free")
grouped_hour_dist_per_crime
As for grouping the 9 recorded levels of the dataset, keeping them separate seems to be the more accurate option as similarly grouped levels do not necessarily share the same trends as their respective "counterparts" do, so generalizing them into joint categories would be misleading and hence lead to false classification of crime type. We can see in figure 8.1 that, due to the previously reigning theft from vehicles, theft is by far the leading crime in terms of frequency in Vancouver from 2003 to 2019. However, when comparing figure 8.2 with 7.1, we see that certain clear locations associated with particular types of crime are now a lot more feint, and trends are not as strong as they previously appeared. As for figure 8.3, when comparing it to 2.2, a lot less variety in trends is also clear. With ungrouped crimes, many different patterns appeared as hours of the day passed. Now, all seem to increase in just as similar a fashion as the next crime.
Despite grouping levels theoretically being a non ideal option, keeping them separate would be rather pointless if we cannot have a relatively accurate model, and grouping would most easily achieve ths purpose. Hence, this is the most feasible way of running such a model.
# Have to have even less values for the k-nn to run in a reasonable amount of time
crime <- sample_n(grouped_crime, 5000) %>%
mutate(BROAD_TYPE = as.factor(BROAD_TYPE))
#spilt up training set and testing set
training_rows<-select(crime,BROAD_TYPE)%>%
unlist()%>%
# unfortunately these categories are too small and need to be filtered out or else the model will complain
# about empty levels
droplevels("Homicide", "Offence Against a Person") %>%
createDataPartition(p=0.75,list= FALSE)
training_set <- crime %>% slice(training_rows)
testing_set<- crime %>% slice(-training_rows)
scale_transformer <- preProcess(training_set, method = c("center", "scale")) # scale the data since knn is sentive to the scale
training_set <- predict(scale_transformer, training_set)
testing_set <- predict(scale_transformer, testing_set)
head(training_set)
head(testing_set)
glimpse(training_set)
glimpse(testing_set)
#check whether the number of observations in training set and testing set are correct
predictors <-select(training_set,X,Y,HOUR)%>%
data.frame() # to get the predictors from training set as the form of data frame
Y_label <-select(training_set,BROAD_TYPE)%>%
unlist()%>% #to get the label as the form of a vector
droplevels("Homicide", "Offence Against a Person")
train_control <- trainControl(method = "cv", number = 5)
ks <- data.frame(k = seq(1,50))
choose_k <-train(x=predictors,y=Y_label,tuneGrid=ks,method="knn",trControl=train_control)
choose_k
k_accuracies <- choose_k$results %>%
select(k, Accuracy)
options(repr.plot.width = 15, repr.plot.height = 10)
choose_k_plot <-k_accuracies %>%
ggplot(aes(x=k,y=Accuracy))+
geom_point()+
geom_line()+
labs(title = "9.0 Accuracy of Model for Different Values of K")+
theme(text = element_text(size = 20))
choose_k_plot
From the graph, we see the accuracy is highest at k = 18, thus we choose 18 as optimal value of k.
final_k = data.frame(k = 18)
final_classifier<- train(x = predictors, y = Y_label, method = "knn", tuneGrid = final_k)
X_test <- testing_set %>%
select(X,Y,HOUR) %>%
data.frame()
Y_test <- testing_set %>%
select(BROAD_TYPE) %>%
unlist()
test_pred <- predict(final_classifier, X_test)
test_results <- confusionMatrix(test_pred, Y_test)
test_results
test_stat <- data.frame(obs = Y_test, pred = test_pred)
options(repr.plot.width=20, repr.plot.height=10)
obs_dist_plot <- ggplot(test_stat, aes(x=obs, fill=obs)) +
geom_bar() +
labs(x="Actual Crime", y="Frequency") +
ggtitle("9.1: Actual Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
obs_dist_plot
options(repr.plot.width=20, repr.plot.height=10)
pred_dist_plot <- ggplot(test_stat, aes(x=pred, fill=pred)) +
geom_bar() +
labs(x="Predicted Crime", y="Frequency") +
ggtitle("9.2: Predicted Crime Distribution") +
coord_flip() +
theme(axis.text=element_text(size=16),
axis.title=element_text(size=20,face="bold"),
title=element_text(size=25, face="bold"))
pred_dist_plot
From the plots above, we see that the abundance of theft from vehicle crimes in the dataset seems to bias the algorithm towards predicting most data points to be theft as opposed to their actual crime.
Overall, building off of what we were initially expecting, we have shown that there are two main categories of factors that impact crime in Vancouver - time and location. Through using the hour that a crime occurred, we were able to discuss which types of crimes are associated with these times, and thus makes them a good candidate for predictors in the k-nn classification algorithm. On the other hand, we have discussed location and it's the role it plays in being a predictor for the type of crime that has occurred. These results are somewhat expected from our introductory exploration, as well as our readings from the articles referenced below. However, there were many interesting results that came about, especially with figures such as 2.2 that allowed us to concretely see how the hour of the day impacts which crimes happen.
After testing this and seeing that keeping crimes separate gives only an accuracy of 40% is obtained and a grouped version returns 60%, despite its lack of reflection over reality this seems to be the more useful option seeing that an accuracy of 40% isn't really helpful in the long run. On the contrary, although 60% is not ideal, it does indeed give a better estimate to work with on a larger scale. We might use this tool to supplement other field based identifications of crime, such as in person investigation and witness reports, as it still helps relatively narrow down the crime but solely is insufficient.
Thus, to answer our question of "What types of crime occur in Vancouver based on time and location of criminal activities from 2003 - 2019 and what are the best predictors of a type of crime?" we can now say that theft is definitely the highest, accounting for more than half of the crimes that occur. The answer to our predictive question is more generally that time and location are the best predictors while we found that hour along with X and Y coordinate did fairly well.
As discussed in our expected outcomes and significance, we hope that these predictors of when crime will occur will help to funnel expensive resources into where they are needed most in order to be more effective when fighting crime. However, this leads to some future questions such as will these trends change over time? If so, what are more specific predictors that will give us insight into the trends that are occurring and in what other ways can we use data as a way of fighting crime? Lastly, our main limiting factor was with not being able to use all of the data due to resource and time constraints, so we would like to see a way to use all this data effectively and perhaps another algorithm that will allow us to do so in a timely fashion to derive more insights.
Andresen, Martin A., Shannon J. Linning, and Nick Malleson. "Crime at Places and Spatial Concentrations: Exploring the Spatial Stability of Property Crime in Vancouver BC, 2003-2013." Journal of Quantitative Criminology 33.2 (2017): 255-75. ProQuest. Web. 7 Apr. 2020.
Shannon J. Linning, Crime seasonality and the micro-spatial patterns of property crime in Vancouver, BC and Ottawa, ON, Journal of Criminal Justice, Volume 43, Issue 6, 2015, ISSN 0047-2352.
Vancouver. Vancouver Police Department. Year-End 2019 Key Performance Indicators Report. Vancouver Police Board, 20 February 2020. Web. 3 April 2020.